from inspect_ai.analysis.beta import messages_df, MessageColumns, SampleSummary
# read messages from log
log = "<path-to-log>.eval"
df = messages_df(log, columns=SampleSummary + MessageColumns)
# mark messages with no tool calls
df.loc[df["tool_call_function"].isna(), "tool_call_function"] = "(none)"
# trim columns
tools_df = df[[
"eval_id",
"id",
"order",
"tool_call_function",
"limit"
]]Tool Usage
Dataset: cybench_tools.parquet
In this example we visualize tool usage over a series of turns in a Cybench evaluation.
Data Preparation
For analysis we read a raw messages data frame from an eval log1, fill the tool call field with “role” if there are no tool calls, then filter down to just the fields we need for visualization:
Note that the trimming of columns is particularly important because Inspect Viz embeds datasets directly in the web pages that host them (so we want to minimize their size for page load performance and bandwidth usage).
Trajectory Analysis
Here we use a cell() mark to visualize tool use over messages in each sample of an evaluation. We note any limit that ended the sample using a text() mark on the right side of the frame.
Code
from inspect_viz import Data
from inspect_viz.plot import plot, legend
from inspect_viz.mark import cell, text
tools = Data.from_dataframe(tools_df)
tools_domain = ["bash", "python", "submit", "(none)"]
plot(
cell(
tools,
x="order",
y="id",
fill="tool_call_function",
),
text(
tools,
text="limit",
y="id",
frame_anchor="right",
font_size=8,
font_weight=200,
dx=50
),
legend=legend("color", location="right"),
color_domain=tools_domain,
margin_top=0,
margin_left=200,
margin_right=100,
x_ticks=list(range(0, 400, 50)),
x_tick_size=4,
x_label=None,
y_label=None
)- 1
-
cell()mark showing tool calls. - 2
-
text()mark showing whether the sample terminated due to a limit. - 3
- Fix the color domian to our pre-set tool ordering.
- 4
- Tweak the margins so the axis labels and text annotations appear correctly.
- 5
- Reduce the number of tick marks on the x-axis.
- 6
- No labels as axes are obvious from tick marks and legand.
Sample Drill Down
Here we stack two plots on top of each other—the original sample-level tool calling plot as well as a bar plot counting the messages which called various tools. You can click on any sample in the top plot to update the bottom plot to count only the tool calls for that sample.
Code
from inspect_viz import Data, Selection
from inspect_viz.plot import plot, legend
from inspect_viz.mark import cell, text, bar_x
from inspect_viz.interactor import toggle_y
from inspect_viz.layout import vconcat
from inspect_viz.transform import count
tools = Data.from_dataframe(tools_df)
click = Selection.single()
vconcat(
plot(
cell(
tools,
filter_by=click,
x="order",
y="id",
fill="tool_call_function",
),
toggle_y(target=click),
text(
tools,
filter_by=click,
text="limit",
y="id",
frame_anchor="right",
font_size=8,
font_weight=200,
dx=50
),
legend=legend("color", location="right"),
color_domain=tools_domain,
margin_top=0,
margin_bottom=0,
margin_left=200,
margin_right=100,
x_ticks=list(range(0, 400, 50)),
x_tick_size=4,
y_domain="fixed",
x_domain="fixed",
x_label=None,
y_label=None
),
plot(
bar_x(
tools,
filter_by=click,
x=count(),
y="tool_call_function",
fill="tool_call_function",
),
y_label=None,
y_domain="fixed",
color_domain=tools_domain[:-1],
margin_left=50,
height=170
)
)- 1
-
Selectionused to filter the plot display (A “single” selection filters out all points not in the target selection). - 2
- Marks are filtered by the “click” selection.
- 3
-
The
toggle_y()interactor updates the “click” selection with the y-value (“sample”) that has been clicked. - 4
- Fix the x and y domains so that click selections don’t cause the axes to change.
- 5
- Bar plot is also filtered by the “click” selection.
Footnotes
The eval log read for this example is in the inspect-viz-example-logs repo↩︎